{r global_options, include=FALSE} knitr::opts_chunk$set(fig.width=12, fig.height=8, fig.path=‘Figs/’, echo=FALSE, warning=FALSE, message=FALSE)

Introduction

I dedcided to pull some data from the City of Toronto because they have begun to open it up and I wanted to take a look at it. I chose this dataset because it was large and I could see where a good amount of analysis could go.

The dataset is a two year period starting in Jan and ending two years later in Jan 2016. It is the records of every evaluation that is completed in the city.

Part of my objective was to see if there were location patterns or categorical patterns to restaurants getting a conditional pass or being closed for a short period of.

Below is what I found.

load all librarys

This where is where I loaded allthe librarys

Load Files and wrangle data

This section Just has the starting data. I made a new dataset in dine_merge. It includes geocoded locations of all the addresses

Basic Information about dataset

Here is just some basic information about the data set. This helped me get a feel for what I could do with it.

# String of the main data set
str(dine)
## 'data.frame':    93671 obs. of  16 variables:
##  $ ESTABLISHMENT_ADDRESS      : Factor w/ 11006 levels "","1 ADELAIDE ST E",..: 1 2 2 2 2 2 2 2 2 2 ...
##  $ ROW_ID                     : int  93671 21135 51116 49665 21137 21136 49664 49668 49666 49663 ...
##  $ ESTABLISHMENT_ID           : int  10552910 9337616 10390332 10384957 9337616 9337616 10384957 10384957 10384957 10384957 ...
##  $ INSPECTION_ID              : int  103653136 103216626 103415774 103375392 103646914 103394329 103292594 103563396 103393348 103179737 ...
##  $ ESTABLISHMENT_NAME         : Factor w/ 12406 levels "'K' STORE","(FAMOUS PLAYERS )CINEPLEX ENTERTAINMENT",..: 2336 9842 4945 11434 9842 9842 11434 11434 11434 11434 ...
##  $ ESTABLISHMENTTYPE          : Factor w/ 51 levels "","Bake Shop",..: 1 31 30 27 31 31 27 27 27 27 ...
##  $ ESTABLISHMENT_STATUS       : Factor w/ 4 levels "","Closed","Conditional Pass",..: 1 4 4 4 4 4 4 4 4 4 ...
##  $ MINIMUM_INSPECTIONS_PERYEAR: int  NA 1 1 3 1 1 3 3 3 3 ...
##  $ INFRACTION_DETAILS         : Factor w/ 325 levels "","Altering floor space in facility without inspector's approval O. Reg  562/90 Sec. 69",..: 1 1 1 1 1 1 239 239 1 1 ...
##  $ INSPECTION_DATE            : Date, format: NA "2014-04-15" ...
##  $ SEVERITY                   : Factor w/ 5 levels "","C - Crucial",..: 1 1 1 1 1 1 3 3 1 1 ...
##  $ ACTION                     : Factor w/ 12 levels "","Closure Order",..: 1 1 1 1 1 1 6 6 1 1 ...
##  $ COURT_OUTCOME              : Factor w/ 10 levels "","Cancelled",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ AMOUNT_FINED               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ geocoded.lat               : num  43.7 43.7 43.7 43.7 43.7 ...
##  $ geocoded.long              : num  -79.4 -79.4 -79.4 -79.4 -79.4 ...
# Names of variables
names(dine)
##  [1] "ESTABLISHMENT_ADDRESS"       "ROW_ID"                     
##  [3] "ESTABLISHMENT_ID"            "INSPECTION_ID"              
##  [5] "ESTABLISHMENT_NAME"          "ESTABLISHMENTTYPE"          
##  [7] "ESTABLISHMENT_STATUS"        "MINIMUM_INSPECTIONS_PERYEAR"
##  [9] "INFRACTION_DETAILS"          "INSPECTION_DATE"            
## [11] "SEVERITY"                    "ACTION"                     
## [13] "COURT_OUTCOME"               "AMOUNT_FINED"               
## [15] "geocoded.lat"                "geocoded.long"
#The interval of when the data started and when it ended
max(dine$INSPECTION_DATE, na.rm=TRUE)
## [1] "2016-01-15"
min(dine$INSPECTION_DATE, na.rm = TRUE)
## [1] "2014-01-16"
# Table of the different types of severity and their counts
table(dine$SEVERITY)
## 
##                             C - Crucial           M - Minor 
##               33471                2344               32745 
## NA - Not Applicable     S - Significant 
##                3987               21124
# Table of the different types of Status and their counts 
table(dine$ESTABLISHMENT_STATUS)
## 
##                            Closed Conditional Pass             Pass 
##                1              340            17384            75946
#The most conistant INfraction Details and the amount of different Infractions
head(sort(table(dine$INFRACTION_DETAILS), decreasing = TRUE))
## 
##                                                        
##                                                  33471 
##       Operator fail to properly wash surfaces in rooms 
##                                                   9092 
##               Operator fail to properly maintain rooms 
##                                                   7799 
##               Operator fail to properly wash equipment 
##                                                   6571 
## Operator fail to properly maintain equipment(NON-FOOD) 
##                                                   2910 
##              Operator fail to provide proper equipment 
##                                                   2246
length(unique(dine$INFRACTION_DETAILS))
## [1] 325
#Unique Addresses 
length(dine$ESTABLISHMENT_ADDRESS)
## [1] 93671
length(unique(dine$ESTABLISHMENT_ADDRESS))
## [1] 11006

Graphs of how different Establishment types get different infractions

I firs did a basic proportional graph of how much each Establishment type got how many infractions. However, I felt that this was not weighted well because there are more of some types than of others. I needed to know what the % of each category was. This was a much more difficult task but I manged to pull it together.

The results of the graphs show that it is not restaurants that were the worst culperits it was chartered Cruise Boats, Rest Homes and mobile food preperation premises.

Graphs for the distribution of fines.

Below is the distribution of fines. I zoomed in on a few graphs to have a closer look.

## Source: local data frame [10 x 2]
## 
##                                                             INFRACTION_DETAILS
##                                                                         (fctr)
## 1  Fail to ensure the presence of the holder of a valid food handler's certifi
## 2                             Operator fail to properly wash surfaces in rooms
## 3  Operate food premise - fail to keep facility clean O. Reg  562/90 Sec. 68(2
## 4                               Operator fail to provide adequate pest control
## 5  Fail to protect food from contamination or adulteration O. Reg  562/90 Sec.
## 6                                     Operator fail to properly wash equipment
## 7  Operator fail to ensure equipment surface washed as necessary O. Reg  562/9
## 8  Fail to hold a valid food handler's certificate - Muncipal Code Chapter 545
## 9                 Operator fail to ensure food is not contaminated/adulterated
## 10 Operate food premise maintained in manner adversely affecting sanitary cond
## Variables not shown: n (int)

## Source: local data frame [10 x 2]
## 
##                                                             INFRACTION_DETAILS
##                                                                         (fctr)
## 1           Operator fail to maintain hazardous food(s) at 4C (40F) or colder.
## 2            Operator fail to maintain hazardous foods at 60C (140F) or hotter
## 3                             Operator fail to properly wash surfaces in rooms
## 4  Maintain hazardous foods at internal temperature between 4 C and 60 C O. Re
## 5                                     Operator fail to properly wash equipment
## 6  Operator must ensure that the most recent food safety inspection notice, as
## 7  Store hazardous foods in container at internal temperature above 5 C O. Reg
## 8  Display hazardous foods at internal temperature between 4 C and 60 C O. Reg
## 9                                     Operator fail to properly maintain rooms
## 10                              Operator fail to provide adequate pest control
## Variables not shown: n (int)

## Source: local data frame [10 x 2]
## 
##                                                             INFRACTION_DETAILS
##                                                                         (fctr)
## 1  Operator must ensure that the most recent food safety inspection notice, as
## 2                               Operator fail to provide adequate pest control
## 3  Fail to protect food from contamination or adulteration O. Reg  562/90 Sec.
## 4                 Operator fail to ensure food is not contaminated/adulterated
## 5            Operator fail to maintain hazardous foods at 60C (140F) or hotter
## 6                                Operator fail to prevent a rodent infestation
## 7                         Operator fail to properly maintain mechanical washer
## 8                                     Operator fail to properly wash equipment
## 9                             Operator fail to properly wash surfaces in rooms
## 10                                      Operator fail to provide approved eggs
## Variables not shown: n (int)

There is not much that we can take away from these graphs except that fines are generally very low. with a few sporatically scattered above 1000. The reasoning for the fines are similar for all groups so the increase in fines might be how often the restaurant gets in trouble. It is possible that the more often they get the same violation the higher the fine goes up. Or possibly the combination of violations.

Geocoding - This is the code I used to make the Graphs

This section I had to completely hastag to get the knit HTML to work.

What I did here was create a loop that would ping google for the lat long for each address in the dataset. I could only do 2500 per a day for free. So it took a week or two to get through the entire dataset. It only took a week or two because even though there is greater than 90,000 entries there were only just over 11,000 unique entries.

See if there is a relation between month and status

In this section I wanted to know if there were months where more restaurants might have failed health expection. My intial thought was that during the busy times there might be more violations: Summer and Christmas

## [1] 1.3831

## [1] 7.083333

## [1] 4.328685

In conclusion the data showed a little bit that there were humps of closures in peak demand periods of summer and christmas but only for closures and although it appeared to be higher the number are so low it might not even be significant. Otherwise, the month did not really have an affect on if the restaurant passed or not.

Restaruants that were closed down in the last two years

List of names of restaurants that were closed in the last two years. I did some additional reading and inspectors said they want to keep businesses open. However, repeat severe offences make them have to close it as previous violations did not change behaviour.

##  [1] MUCHO BURRITO                     HO SHIM/ACKO LOUNGE              
##  [3] KABSA MANDI                       DOSA DARBAR                      
##  [5] J & C VARIETY                     WOKKING ON WHEELS                
##  [7] STARBUCKS                         TASTE OF TANDOOREE               
##  [9] BOMBAY CHOWPATTY                  WESTWOOD PLACE BURGERS RESTAURANT
## [11] SPENCE'S BAKERY                   BOSTON VARIETY AND FISH          
## [13] MANCHU WOK                        AMSTERDAM GUEST HOME             
## [15] MELEWA BAKERY                     CHAKO BARBEQUE ISAKAYA           
## [17] Karachi Kitchen                   CAKE HOUSE BAKERY                
## [19] YOGI NOODLES                      ALI BABA'S                       
## [21] DRAGON HOUSE CHINESE FOOD         GOLD CIRCLE DAYCARE CENTRE       
## [23] ATWINA MARFO ENTERPRISES          PHO MI VIET HOA RESTAURANT       
## [25] PEARL BAYVIEW CHINESE CUISINE     CULTURES/VANELLIS                
## [27] TEMPLETON CAFE                    STELLA BOREALIS                  
## [29] SPRING ROLLS GO                   GLOUCESTER BAKERY                
## [31] FORTUNE SEAFOOD RESTAURANT        JOE'S BISTRO                     
## [33] WHAT A BAGEL                      CHINESE BAKERY                   
## [35] PERFECT CHINESE RESTAURANT        SUN VIEW BAKERY                  
## [37] FARM FRESH SUPERMARKET            FONG ON FOODS LIMITED            
## [39] SMART CHOICE FOOD MART            AGRA FINE INDIAN CUISINE         
## [41] KIM VIETNAMESE RESTAURANT         RIVIERA BAKERY                   
## [43] TAKIMI SUSHI ASIAN CUISINE        GINGER RESTAURANT                
## 12406 Levels: 'K' STORE ... ZYNG
## 
##            Restaurant                Bakery         Food Take Out 
##                   172                    55                    29 
##           Supermarket     Food Court Vendor Food Processing Plant 
##                    26                    20                    15

Just having a look at the list there appears to be a lot of asian themed food palces that did get the final ax. This will show up more in the final analysis on the maps.

Merging the geocoded data with the old dataset

Finally got through the geocoded and continued on to merge the datasets together and than save it to dine_merge, which is what you saw above.

Clustering

I decided to give clustering a try to see if it could highlight where in the city were repeat offenders

This part, sets up the clustering algorithm

I used two different alogrithms because I found K clustering worked better for “closed” and dbscan worked well for “conditional pass”

#For Closed Restaurants 
closed <- subset(dine, dine$ESTABLISHMENT_STATUS == "Closed")
closed <- cbind(unique(closed$geocoded.lat), unique(closed$geocoded.long))
kluster <- kmeans(closed, 11)
closed <- data.frame(closed)
closed$ki <- kluster$cluster

#creating the point size
closed_size <- data.frame(kluster$centers)
closed_size$n <- kluster$size

#For Conditional Pass
cp <- subset(dine, dine$ESTABLISHMENT_STATUS == "Conditional Pass")
cp <- cbind(unique(cp$geocoded.lat), unique(cp$geocoded.long))
kluster_mean <- kluster <- kmeans(cp, 60)
kluster2 <- dbscan(cp, eps=0.001)
cp <- data.frame(cp)
cp$ki <- as.factor(kluster2$cluster)

  #For the second part of CP. WIth different point sizes
cp_size <- data.frame(kluster_mean$centers)
cp_size$n <- kluster_mean$size

Putting the Cluster on a map for Closed Restaurants

This are the maps I put together for closed restaurants.

They are of the city of Toronto. The different shapes just show different clusters and have not categorical meaning

The second set of maps have dot sized counts of cluster centers.

## $title
## [1] "Map Zoomed in to North York"
## 
## attr(,"class")
## [1] "labels"

This section Showed that the predominatly asian neighbourhoods have the clusters overtop of them. This again might suggest that there might be a communication problem.

Putting the Cluster on a map for Conditional Pass Restaurants

These are the maps I put together for Conditional Pass Restaurants.

The different colours have no categorical meaning. They are just to highlight different clusters.

The second set of grpahs took the cluster centers and made dot sized maps

The biggest points of CP map overtop of those closed. There are lots of points scattered around Dundas street. Notorisouly known for mom and pop shops and bloor street. It is interesting that there are no dots in the finacial district and only one by the waterfront. These places usually have high leases and attract an upper class client.

Final Plots and Summarys

Plot 1 In this graph you can see that about 6 to 8% off all inspections result in a conditional pass. This seems high. Toronto restaurants need to do better.

Closures on the other hand are very low. Less than .10% of restaurants.

This seemed odd to me. I read an interview with a inspector and he mentioned he will do a lot before he issues a restaurant a close. From a previous graph the distribution of the different violoations were the same for closed and Conditional Pass. There was about a three violation difference between those that got closed and those that got a conditional pass. This maybe where the difference lies or it is in the frequency of conditional passes in a row.

The only significant difference for closed restaruants is that they happen more often in the middle of summer and just before the holidays.

Note: the totals do not equal 100% because there were many with a blank status

grid.arrange(p1,p2,p3, 
             top = "What % of restaurants get a 
             specific type of status by month")

Plot 2 As mentioned above, this distribution is interesting because most of the data was in restaurants but restaurants are just above the average. The highest are Rest Homes which are old age homes. Old age homes do not appear in the closed data however. There is probably a reason for that as they would have to move everyone out of it. The second highest was mobile food prep, which also appeared high on the closed graph.

This provided a very interesting result. There is a mismanagemend of old age homes in the city of Toronto.

ggplot(aes(x=reorder(ESTABLISHMENTTYPE,-cp_p), y=cp_p),
       data= subset(full_table, full_table$cp_p > 0)) +
       geom_bar(stat="identity") +
       theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +   
       xlab("Establishment Type") +
       ylab("% of category") +
       ggtitle("Etablishment Type that got a Conditional Pass")

Plot 3

I wanted to do a density map initially put it never looked good. Instead what I did was a clustering algorithm to highlight areas that were dens. I did this by increasing the size of the dots via cluster centers. From the data it is easy to see that the same areas that are give closed are also high in conditional pases.. These areas that have dots are normally thought of as cheaper areas to eat and drink, while areas like the finacial district do not have any dot. A lot of the data are areas of ethnic and specifically of Asian concentration. This might ential that there is a communication barrier of the rules to these establisments.

It does need to be noted that these areas have a high density of restaurants and to get a true feel the map should be tied to a proportion. This was too difficult for this small project and having lived in the city of Toronto. The major differences are moving from downtown to the suberbs which is why I have to different maps with two different settings. So they can show locational hubs of violation in restaurants.

There are no x,y labels as it is a map.

p3<-ggmap(map13, extent="device") +
  geom_point(aes(x = X2, y = X1, size=n), 
             data = cp_size, alpha=0.4)  +
             ggtitle("Map Zoomed in to downtown Toronto")

p4<-ggmap(mapsc, extent="device") + 
  geom_point(aes(x = X2, y = X1, size=n), 
             data = cp_size, alpha=0.5) + 
             ggtitle("Map zoomed in to Scarborough")

p5<-ggmap(map13, extent="device") +
  geom_point(aes(x = X2, y = X1, size = n), 
             data = closed_size)  +
             ggtitle("Map zoomed in to downtown Toronto")

p6<- ggmap(mapsc, extent="device") +
  geom_point(aes(x = X2, y = X1, size = n), 
             data = closed_size)  +
             ggtitle("Map zoomed in to Scarborough")

grid.arrange(p3,p4, p5, p6, 
    top= "City of Toronto Conditional Pass/Closed violation maps")
## Warning: Removed 42 rows containing missing values (geom_point).
## Warning: Removed 56 rows containing missing values (geom_point).
## Warning: Removed 5 rows containing missing values (geom_point).
## Warning: Removed 10 rows containing missing values (geom_point).

Reflection

I may have taken a different project than other udacity students. I really wanted to do something local to where I live and something someone has not done before. I was a little pissed off when I found there was an interactive map by the city with the same data. However, they did not display the analysis that I did so it was not all for nothing.

Some of the areas that had struggled with were in setting up the chain to format the data the way I needed to build the graphs. It took a fair bit of trial and error to get it the way I wanted it

My major hurdles were with getting the data geocoded. The limit of 2500 pings to google really slowed me down and made it very confusing as to where I was in the process. I found that when I finished I had more data points than my original dataset and that a few hundred unique addresses were missing. To fix this I had to figure out what was missing make an array to get those geocoded, merge it with the dataset then delete identical entries. After all that I had the same length and same unique address values.

My last major struggle, in which I put a white flag up was with making a density map. I could not get it to look good. I found fiddling with the point size and alpha produced a more informative map. I try

The data could be enriched with more years of data and more information about what type of restaurant it is. All it says is restaurant in that category. Having completed all this work I did find success in learning new things on my own without the help of udacity. This included how to use ggmaps, Kclustering and creating my own for loops in R. I’m pretty happy with the maps I created and proud that I finally got it done.

A list of websites

I’m sorry I totally forgot to make a list of places where I used information.

I can tell you most of it was from stackoverflow and I learned about ggmaps from multiple blog posts about how to use it. There were so many I don’t even know where to start in terms finding them again. One blog post helped me write the for loop to get the addresses geocoded. It was very helpful